In this problem set, you’ll continue to explore the diamonds data set.
Your first task is to create a scatterplot of price vs x. using the ggplot syntax.
# You need to run these commands each time you launch RStudio to access the diamonds data set. RStudio won't remember which packages and data sets you loaded unless you change your preferences or save your workspace.
setwd("~/Desktop/Nanodegrees/Data Analyst/Task 4/Data Analysis with R/Explore Two Variables/Files/")
library(ggplot2) # must load the ggplot package first
data(diamonds) # loads the diamonds data set since it comes with the ggplot package
#summary(diamonds)
ggplot(aes(x, price), data = diamonds) +
geom_point(alpha = 0.05)
The plot shows that most diamonds have a length between 4 and 8mm, and that the diamonds price increases as length gets longer, with the highest priced diamonds having the longest length.
There is an exponential relationship between price and x and there are some outliers which are very large and high priced.
# Betwen price and x
with(diamonds, cor.test(price, x, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: price and x
## t = 440.16, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8825835 0.8862594
## sample estimates:
## cor
## 0.8844352
# Betwen price and y
with(diamonds, cor.test(price, y, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: price and y
## t = 401.14, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8632867 0.8675241
## sample estimates:
## cor
## 0.8654209
# Betwen price and z
with(diamonds, cor.test(price, z, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: price and z
## t = 393.6, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8590541 0.8634131
## sample estimates:
## cor
## 0.8612494
0.88
0.87
0.86
Create a simple scatter plot of price vs depth.
names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
ggplot(aes(x = depth, y = price), data = diamonds) +
geom_point()
Change the code to make the transparency of the points to be 1/100 of what they are now and mark the x-axis every 2 units. Hint 1: Use the alpha parameter in geom_point() to adjust the transparency of points.
Hint 2: Use scale_x_continuous() with the breaks parameter to adjust the x-axis.
ggplot(aes(x = depth, y = price), data = diamonds) +
geom_point(alpha = 0.01) +
scale_x_continuous(breaks = seq(0,80,2))
59 to 64mm
# Correlation of depth vs price
with(diamonds, cor.test(depth, price, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: depth and price
## t = -2.473, df = 53938, p-value = 0.0134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.019084756 -0.002208537
## sample estimates:
## cor
## -0.0106474
-0.01
I wouldn’t because a score of -0.01 implies no correlation.
Create a scatterplot of price vs carat and omit the top 1% of price and carat values.
Both the graphs below are the same, they have different scales though.
ggplot(aes(carat, price), data = diamonds) +
geom_point(alpha = 0.1) +
coord_cartesian(xlim = c(0, quantile(diamonds$carat, 0.99)),
ylim = c(0, quantile(diamonds$price, 0.99)))
ggplot(aes(x = carat, y = price),
data = subset(diamonds, diamonds$price < quantile(diamonds$price, 0.99) &
diamonds$carat < quantile(diamonds$carat, 0.99))) +
geom_point(alpha = 0.1)
Create a scatterplot of price vs. volume (x * y * z). This is a very rough approximation for a diamond’s volume.
Create a new variable for volume in the diamonds data frame. This will be useful in a later exercise.
# Volume variable
diamonds$vol <- (diamonds$x * diamonds$y) * diamonds$z
ggplot(aes(x = vol, y = price),
data = diamonds) +
geom_point(alpha = 0.02) +
coord_cartesian(xlim = c(40, quantile(diamonds$vol, 0.95)),
ylim = c(0, quantile(diamonds$price, 0.95))) +
geom_smooth(color = 'red')
ggplot(aes(x = vol, y = price),
data = diamonds) +
geom_point() +
geom_smooth(color = 'red')
I found that there seems to be a correlation between volume and price. As the volume exceeds 200, the line of best fit shows that the price is going up.
Diamonds with a volume of 200+ are rare and the majority of diamonds have a volume of between 45 - 155.
There seems to be some vertical lines that are overplotted, this is probably because there are standard cut volumes and the diamonds have to fall within the range (ie. around 165, 150, 125).
There is an outlier with a volume near 4000 and a cheaper diamond with a volume near 900.
with(subset(diamonds, vol > 0 & vol < 800), cor.test(vol, price))
##
## Pearson's product-moment correlation
##
## data: vol and price
## t = 559.19, df = 53915, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9222944 0.9247772
## sample estimates:
## cor
## 0.9235455
0.9235455
Subset the data to exclude diamonds with a volume greater than or equal to 800. Also, exclude diamonds with a volume of 0. Adjust the transparency of the points and add a linear model to the plot. (Types of smoothers in ggplot2.)
Stat_smooth and Geom_smooth are aliases of each other.
ggplot(aes(x = vol, y = price),
data = subset(diamonds,
diamonds$vol < 800 & diamonds$vol > 0)) +
geom_point(alpha = 0.02) +
geom_smooth(color = 'red')
I don’t think it’s a useful model because after a volume of 400, the price begins to drop and I find that relationship to be unlikely given the correlation between price and volume > 0 & < 800.
Use the function dplyr package to create a new data frame containing info on diamonds by clarity.
Name the data frame diamondsByClarity.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
diamondsByClarity <- diamonds %>%
group_by(clarity) %>%
summarise(mean_price = mean(price),
median_price = median(price),
min_price = min(price),
max_price = max(price),
n = n())
head(diamondsByClarity)
## # A tibble: 6 × 6
## clarity mean_price median_price min_price max_price n
## <ord> <dbl> <dbl> <int> <int> <int>
## 1 I1 3924.169 3344 345 18531 741
## 2 SI2 5063.029 4072 326 18804 9194
## 3 SI1 3996.001 2822 326 18818 13065
## 4 VS2 3924.989 2054 334 18823 12258
## 5 VS1 3839.455 2005 327 18795 8171
## 6 VVS2 3283.737 1311 336 18768 5066
We’ve created summary data frames with the mean price by clarity and color.
Your task is to write additional code to create two bar plots on one output image using the grid.arrange() function from the package gridExtra.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Summary data frames with mean price by clarity and color
diamonds_by_clarity <- group_by(diamonds, clarity)
diamonds_mp_by_clarity <- summarise(diamonds_by_clarity, mean_price = mean(price))
diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))
diamonds_by_cut <- group_by(diamonds, cut)
diamonds_mp_by_cut <- summarise(diamonds_by_cut, mean_price = mean(price))
head(diamonds_mp_by_color)
## # A tibble: 6 × 2
## color mean_price
## <ord> <dbl>
## 1 D 3169.954
## 2 E 3076.752
## 3 F 3724.886
## 4 G 3999.136
## 5 H 4486.669
## 6 I 5091.875
head(diamonds_mp_by_clarity)
## # A tibble: 6 × 2
## clarity mean_price
## <ord> <dbl>
## 1 I1 3924.169
## 2 SI2 5063.029
## 3 SI1 3996.001
## 4 VS2 3924.989
## 5 VS1 3839.455
## 6 VVS2 3283.737
## Creating bar plots
p1 <- ggplot(aes(color, mean_price), data=diamonds_mp_by_color) +
geom_bar(stat='identity')
p2 <- ggplot(aes(clarity, mean_price), data = diamonds_mp_by_clarity) +
geom_bar(stat='identity')
p3 <- ggplot(aes(cut, mean_price), data = diamonds_mp_by_cut) +
geom_bar(stat='identity')
grid.arrange(p1, p2, p3)
In price by color, the mean price goes up as the color get’s worse.
In price by clarity, the S12 clarity has the highest mean price, the VVS1 clarity has the lowest.
Mean price tends to decrease as clarity improves. The same can be said for color. This seems counterintuitive.
The Gapminder website contains over 500 data sets with information about the world’s population. Your task is to continue the investigation you did at the end of Problem Set 3.
If you’re feeling adventurous or want to try some data munging see if you can find a data set or scrape one from the web.
In your investigation, examine pairs of variable and create 2-5 plots that make use of the techniques from Lesson 4.
library(tidyr)
library(dplyr)
library(ggplot2)
library(gridExtra)
list.files()
## [1] "correlation_images.jpeg"
## [2] "E_2_Problem_Set_files"
## [3] "E_2_Problem_Set.html"
## [4] "E_2_Problem_Set.rmd"
## [5] "indicatorwdigdp_percapita_growth.xlsx"
## [6] "Investment.xlsx"
## [7] "lesson4_student.html"
## [8] "lesson4_student.rmd"
## [9] "pseudo_facebook.tsv"
# Reading excel file into studio
library('rio')
gdp = import('indicatorwdigdp_percapita_growth.xlsx')
invest = import('Investment.xlsx')
# Tidying data. GDI stands for Gross Domestic Investment
gdp = gather(gdp, "year", "GDP", 2:53)
invest = gather(invest, "year", "GDI", 2:53)
# Changing column names in both
colnames(gdp) <- c("country","year","GDP")
colnames(invest) <- c("country","year","GDI")
# Merging datasets by year
data <- merge(gdp, invest, by = c("country","year"))
data <- filter(data, !is.na(data$GDI))
data <- filter(data, !is.na(data$GDP))
# Seeing the correlation
with(data, cor.test(GDP, GDI))
##
## Pearson's product-moment correlation
##
## data: GDP and GDI
## t = 18.877, df = 6639, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2027515 0.2484058
## sample estimates:
## cor
## 0.2257026
# Plotting a scatter plot
ggplot(aes(x = GDI, y = GDP), data = data) +
geom_point(alpha = 0.2) +
geom_smooth()
# Understanding how the mean and median of GDP and GDI varies with countries
data.country_by_gdi <- data %>%
group_by(country) %>%
summarise(gdi_mean = mean(GDI),
gdi_median = median(GDI),
gdp_mean = mean(GDP),
gdp_median = median(GDP),
n = n()) %>%
arrange(country)
p1 <- ggplot(aes(x = gdp_mean, y = gdi_mean),
data = data.country_by_gdi) +
geom_line()
p2 <- ggplot(aes(x = gdp_median, y = gdi_median),
data = data.country_by_gdi) +
geom_line()
grid.arrange(p1,p2, ncol = 1)
# Plotting a more in depth scatter plot, with a line plot overlayed
ggplot(aes(x = gdi_mean, y = gdp_mean), data = data.country_by_gdi) +
geom_point(alpha = 0.2) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
linetype = 2, color = 'red') +
geom_smooth()
I investigated the relationship between gdp growth and gdp investment.
The correlation between GDP and GDI is 0.2257026.
The First Scatter plot indicates that most of the GDI is between 0 to 40. Most of the GDP data is between -15 and +15. GDI seemingly has little influence on the GDP as the GDP has a lot of -ve and +ve growth regardless. There are also a few anomalies such as a GDI of 40 with a resulting GDP of 75 or a GDI of 120 and a GDP of 25.
As can be seen in the second graph, the GDP and GDI means and medians vary greatly at each area.
The overlapping scatterplot shows that the gdp and gdi means have a positive correlation.